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Abstract 

Identifying overlapping communities in networks is a challenging task. In this work we present a novel 
approach to community detection that utilises the Bayesian non-negative matrix factorisation (NMF) model 
to produce a probabilistic output for node memberships. The scheme has the advantage of computational 
efficiency, soft community membership and an intuitive foundation. We present the performance of the 
method against a variety of benchmark problems and compare and contrast it to several other algorithms 
for community detection. Our approach performs favourably compared to other methods at a fraction of the 
computational costs. 

Keywords: Community detection, non-negative matrix factorisation, Bayesian inference. 

1 Introduction 

The network paradigm is widely used to model real-world complex systems, by focusing on the pattern of as- 
sociations between their structural components. A system is captured as a mathematical graph, where nodes (or 
vertices) denote the presence of an entity and edges (or links) signify some sort of association (or interaction). 
In contrast to other data manipulation approaches where each element is described by a set of attributes (for 
example x G M D ), here our data is captured in a relational form and inferences are made primarily based on 
their connectivity patterns. 

Real-world networks differ from the classic Erdos-Renyi random graph because the presence of a link be- 
tween two nodes is not generated by a Bernoulli trial with same success probability across all possible pairs. 
Instead, real- world networks exhibit an inhomogeneous distribution of edges among vertices lfl2l . creating 
'hotspots' of hightened connectivity. These modules or communities are densely connected, relatively inde- 
pendent compartments [12] [20] of the network that account for its form and function as a system |[26l . The 
intuition behind that mesoscopic organisation of networks is intuitively straightforward, with many examples 
from everyday life; human social networks consist of cliques of friends, Web pages can be grouped into collec- 
tions with similar topic, etc. 

As an example, consider the simple undirected graph of Fig. [T] described by an adjacency matrix A € 
jg>7Vx7V w h ere we can immediately distinguish the two densely connected compartments CI and C2. As we 
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are not sure of the membership of node 5, it is fairly reasonable to consider it as an overlap of CI and C2. 
Therefore, we can express our community partition as an expansion of our network to a bipartite graph with 
incidence matrix B € ~R NxK so that bn~ = 1 denotes that node i belongs to a group k and is zero otherwise. 
Although the human eye is an excellent analytic tool for simple visualised data [24], the algorithmic process of 
identifying the number of groups, classifying their members and spotting overlaps in any given network is far 
from straightforward. 
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Figure 1 : The problem of community detection can be viewed as mapping our network to a bipartite graph, 
where the first class corresponds to individual nodes and the second to communities. Links connecting nodes to 
multiple communities help us capture overlapping phenomena, in which an individual can participate in many 
groups. 



Extracting community structure from a network is a considerably challenging task H261 . both as an inference 
and as a computational efficiency problem. One of the main reasons is that there is no formal, application- 
independent definition of what a community actually is IPT21 : we simply accept that communities are node 
subsets with a 'statistically surprising' link density, usually measured 'modularity' Q from Mark Newman and 
Michelle Girvan |[27l . Nevertheless, this definition lies in the heart of modern community detection algorithms, 
manifested either as an exploration of configurations (such as B described above) that seek to approximately 
maximise Q (as direct optimisation is NP-hard lfT2ll ). or other techniques that exploit characteristics of the 
network topology to group together nodes with high mutual connectivity (and use Q as a performance measure). 
A fairly comprehensive review of existing methods is provided in ll30l and lfl2l . while O provides a table 
summarising the computational complexity of popular algorithms. 

The most significant drawbacks of modern community detection algorithms are hard partitioning or/and 
computational complexity. Many widely popular methods such as Girvan-Newman ETll . Extremal Optimisa- 
tion [10] or Spectral Partitioning ll25ll cannot account for the overlapping nature of communities, which is an 
important characteristic of complex systems |[T2ll . On the other hand, the Clique Percolation Method (CPM) 
[ 28 ] allows node assignment to multiple modules but does not provide some 'degree of belief on how strongly 
an individual belongs to a certain group. Therefore, our aim is not only to model group overlaps but also cap- 
ture the likelihood of node memberships in a disciplined way, expressing each row of the incidence matrix B 
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as a membership distribution over communities. Additionally, we want to avoid the computational complexity 
issues of traditional combinatorial methods. 

Towards these goals, community detection can be seen as a generative model in a probabilistic framework. 
This has the advantage that, in principle, fully Bayesian models may be formulated in which priors exist over 
all the model parameters. This enables, for example, model selection (to determine e.g. how many communities 
there are) and the principled handling of uncertainty, noise and missing data. 

In this paper we consider the case in which constraints exist in our beliefs regarding the generative process 
for the observed data. In particular we investigate the issue of enforcing positivity onto the model. As the 
model we consider is evaluated in a Bayesian manner, model selection may be applied to infer the complexity 
of representation in the solution space of inferred communities. 

This paper is organised as follows. We first introduce the basic concepts of the theory and our data decom- 
position goals. Details of the Bayesian paradigm under which model inference is performed are then presented. 
Representative results are given in the next sections followed by conclusions and discussion. 



2 Methods 

We consider a matrix of observed interactions between a set of N atoms, which in the context of community 
detection we consider to be individuals. We consider an interaction matrix denoted V G M+ xN such that Vij 
represents a count process detailing the number of interactions between atoms (individuals) i and j. [j] 



2.1 Decomposition 

We consider the decomposition of the observed interaction matrix V as a linear combination of K canonical 
communities, each of which can be seen as a latent (hidden) generator of interactions between atoms. Hence, 

K 

Vij = ^Wikhkj, (1) 
fe=l 

in which we regard the Wik as mixing coefficients and the hkj as elements forming a basis set of community 
structures. The above equation may be re-written in matrix form by defining W € ~R NxK and H £ 'K KxN 

V = WH. (2) 

Without constraint, Equation [2] is ill-posed, i.e. an infinite number of equivalent solutions exist. Many well- 
known decomposition methods can be seen as members of a family of approaches which impose constraints in 
the solution space to make the matrix-product decomposition well-posed. 



PC A: Principal Component Analysis [ 17 ] avoids the problem of an ill-posed solution space by making several 
constraints. The first is that the observed data distrbution and the basis are Gaussian distributed (in that only 
second-order statistics are employed in the PCA formalism) and that the basis is orthogonal. This still leaves a 
permutation degree of freedom, which is removed by sorting the basis in order of variance (in effect an ordering 
of the eigenvalues). 

'A brief note on nomenclature - as the method described here has its roots in the work in non-negative matrix factorization (NMF), 
we keep to the historic nomenclature of the latter to make easier access for the reader who wishes to follow up primary references. 
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ICA: Independent Component Analysis JH [3TJ makes the solutions to Equation [2] well-posed by forcing 
statistical independence between the components of the basis without making strong assumptions regarding 
the Gaussianity of the component distributions, unlike PCA. This gives rise to a methodology which allows 
projective decompositions similar to PCA for non-Gaussian data. 

NMF: Non-negative Matrix Factorization makes the assumption that all the elements of matrices V, W, H 
lie in M.+. This is enough, up to an arbitrary scaling degree of freedom, to ensure that solutions to Equation 
[2] are well-posed. The latter assumptions of non-negativity match our prior beliefs regarding the generation of 
the observed interaction count data, V, and we extend our discussion of NMF in the following sections. 

2.2 Generative Model 

We consider the observed data to be modelled as a Poisson process with expectations given by the elements of 
V = WH. This Poisson model lies at the core of the NMF methodology first developed by Lee & Seung ll2T1l . 
The standard maximum-likelihood solution is to find W, H such that p(V\ W, H) is maximized, or alternately, 
the energy function — logp(V| W, H) is minimized. 

Consider an element v of V and the associated expected counts from the model, v. The negative log 
likelihood of v under a Poisson model is: 

— \ogp{v\v) = —vlogv + v + \ogv\. (3) 

Using the Stirling approximation to second order, namely 

log v\ = v log v — v + - log(27rw) , (4) 

so Equation [3] can be written as, 

— logp(v\v) = v log f-) + v — V H log(27Tf ) (5) 

\vJ 2 

and the full negative log-likelihood for all the observed data as 

- logp(V|V) = - ^ J>gp( % |%). (6) 

» 3 

2.2.1 Shrinkage hyperparameters 

As each of the k € {1...K} columns of W and rows of H represents the contribution from a single latent 
community (as per Equation [TJ we allow for different shrinkage hyperparameters, defined as the set of {fa}. 
Following the development of this model in [33] and similar models for probabilistic PCA ll34l and ICA IP71I32T1 
we place independent half-normal priors over the columns of W and rows of H in which the fa may be seen 
as precision (inverse variance) parameters: 

p(w tk \fa) = nAr(w ik \o,p^) 

p(h kj \fa) = HN(h kj \0,P k 1 ), (7) 



where 



HN(x\0, /T 1 ) = \[^P 1/2 exp (-^ 2 ) • (8) 
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Defining the vector (3 as [/3i, /3k-] » this leads to negative log priors over W and H as: 



- logp(W|/3) = ~ 2" log/3fc + C ° nSt ' 

i k 

- logp(H|/3) = Y, E 2^^' " f log/3fc + COnst ' (9) 

The net effect of (3 on the elements of W and H may be considered as follows. As the negative log probability 
may be regarded as an error or energy function our goal is to descend its surface to a point of minimum. 
Consider the negative derivative w.r.t. a single element of W or H of the above Equations, e.g. 

0(-logp(W|/3)) 

a = -PkWik- (10) 

OW ik 

Incremental changes in wi k , as we iterate towards a solution, are proportional to the negative gradient of the en- 
ergy function, so the effect of the prior is to promote a shrinkage to zero of W{ k with a rate constant proportional 
to /3fc. A large /3& represents a belief that the half-normal distribution over wi k has small variance, and hence Wi k 
is expected to lie close to zero. As we shall see, the priors and the likelihood function (quantifying how well we 
explain the data) are combined with the net effect that columns of W (and rows of H) which have little effect 
in changing how well we explain the observed data will shrink close to zero. This generic approach is well 
known in the statistics literature, as shrinkage or ridge regression [2] and in the machine learning community 
as automatic relevance determination HI . 

Finally we must place prior distributions over the (3 k . We assume the set of (3k are independent |^] and as 
these are scale hyperparameters we place a standard Gamma distribution over them ||2l : 

p(Pk\a k , b k ) = rf-^ft" -1 exp (-(3 k b k ) , (11) 
J- \ak) 

in which the hyper-hyperparameters a k , b k defining the gamma distrubution over (3 k are fixed. The negative log 
of the probability distribution over (3 is hence, 

- logp(/3) = Y ifah ~ (a k ~ 1) log Pk] + const. (12) 

k 

2.2.2 Overall posterior cost function 

Figure [2] shows the generative graphical model for the NMF method. The joint distribution over all variables is 

p(V, W, H, (3) = P (V\W, H)p(W|/3)p(H|/3)p(/3), (13) 
and the model posterior over all parameters, given the observations is: 

KW,H,flV)- £Wp . (,4, 



2 This corresponds to the belief that the existence of one community is not dependent upon others. Clearly, there will be situations in 
which this can be extended to allow for a full inter-dependency between communities. We do not consider this here, however. Allowing 
dependency is similar to the notion of structure priors discussed in |29|. 
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Figure 2: Graphical model showing the generation of count processes, V from the latent structure W and H 
the components of which have scale hyperparameters /3. The hyper-hyperparameters a, b are fixed in the model. 



Noting that p(V) is a constant w.r.t. the inference over the model's free parameters, we aim hence to maximize 
the model posterior given the observations. This is equivalent to minimizing the negative log posterior, which 
we may regard as an energy (or error) function, U say. We hence define 



U = logp(V|W,H) - logp(W|/3) - logp(H|/3) - logp(/3). 



(15) 



Expanding this expression using the results from Equations [5J [6| [9] and 12 and collating all terms independent 
of the model parameters into a constant, gives: 



U 



+ 




Vij log 



J i3 



+ ^ kbk ~ ( afc ~~ X ) log ^ + const. 



(16) 



2.3 Parameter Inference 



There are a variety of approaches one could take to infer W, H, (3 given Equation 16 In this paper we follow 
||2T1 l22l |3l [33J and utilize a rapid fixed point maximum a posteriori (MAP) algorithm which guarantees to 
preserve the non-negativity of all parameters in the model. At each iteration the following re-estimations are 
made, 

H - (Wtbh) wT (vJs) (17) 

w - (ibt^wb) {m) HT < 18 » 

in which we define B as having the elements /3k along its diagonal and zeros elsewhere and 1 is a vector 
of ones. We keep to the notation convention that (y) represents element-by-element division and does not 
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represent XY 1 . The values of fa are re-estimated by setting to zero the derivative w.r.t. the fa of the energy 



function in Equation 16 This gives an estimate for fa as 



„ N + a k -l 

fa «- 77 x (19) 

which is the same as the update equation detailed in ||33TL The algorithm proceeds by cycling through Equa- 
tions 17|18| 19 until a convergence criterion or maximum number of iterations is reached. We note that a fully 
Bayesian approach to NMF is developed using variational inference in (SJ. This offers certain potential im- 
provements over the maximum a posteriori (MAP) solution at the expense of computational speed. The latter 
we regard as a very important feature of any algorithm for community detection and the current emphasis of 
our work is directed by computational efficiency. 

2.4 Probabilistic community membership 

As we may write the observed data as a linear combination of community basis structures and the mixing 
fractions are strictly non-negative, the model is identical to a mixture model in which the elements wik denote 
the relative importance of community k in explaining the observed interactions associated with member i. 
Under the assumption that the i-th member's interactions are explained by some community memberships, it is 
reasonable to define degrees of community membership, 71"^., which sum to unity for each member, as: 

Kik = t^t — — • (20) 
Lfc' w ik' 

A greedy community allocation scheme for member i is easily achieved, if desired, by choosing the community 
k* which is the argmax of either the Wik or n{k- 



3 Results 

In this section we demonstrate the performance of NMF-based community detection against a variety of bench- 
mark problems. We start with a toy network to illustrate the intuition behind our clustering methodology using 
graphical examples. Afterwards, we continue by testing our method against artificial problems with observed 
community structure. Finally, we test our method against popular real-world networks of various sizes and 
levels of community cohesiveness. 

3.1 Initialization 

In all the results presented in this paper we allow the maximum number of possible communities K to equal 
N, the number of members. The hyper-hyperparameters, a and b, which govern the scale of the shrinkage 
hyperparameters, fa, are fixed at a = 1, b = 2 so that fa all have vague distributions over them. The initial 
matrices W, H each have elements drawn independently at random from a uniform distribution in the interval 
[0,1]. 

The interaction matrix V we use for NMF is derived from the (weighted) adjacency matrix of each network 
with diagonal elements the strengths of each node (the sum of each row or column of the adjacency matrix). 
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3.2 An Illustrative Example 

Consider the simple toy graph of Fig. [3] with N = 16 nodes and M = 25 edges of varying weights. We 
extract the mesoscopic (community) structure of this network using NMF, along with the popular Extremal 
Optimisation (EO) HOl, Spectral Partitioning (SP) J25[ and Weighted Clique Percolation Method (wCPM) 
ED. 




Figure 3: An undirected weighted toy graph with 16 nodes. Each pair of nodes has a different interaction 
strength as denoted by the different lines. 

Although a trivial problem at first glance, each community detection method we applied yielded different 
modules and node allocations, as seen in Fig. [4] Hard-partitioning methods such as EO and SP produce such 
inconsistencies mainly due to the 'broker' nature of nodes such as 6, 9 or 10, which lie on high-flow paths in the 
network, making them difficult to assign on one module or the other |[T2l . Although this issue is addressed by 
wCPM, which allows node membership to multiple modules, it does not provide some measure of 'participation 
strength' or 'degree of belief in membership'. 

In NMF, communities are viewed as basis structures, captured in the model as the K columns of our basis 
matrix W (see Section [2]). In this framework, we consider each basis structure or community k to have a total 
binding energy that is allocated to the atoms (nodes) based on G M^ xl . For example in Figure|5j we take 
a column of W and draw a colormap (left frame) based on the intensity of its elements. Components with 
non-zero energy correspond to nodes that participate in such basis structure and form a subset of the whole 
network (right frame). We can see that this basis community in Fig. [5] is dominated by nodes 6, 7 and 8, which 
contribute most of the binding energy, while the peripheral nodes 4, 5, 9, 10 have some minor participation. 

We applied NMF to our synthetic graph, where we extracted K* = 4 communities (bases to which at least 
one member is allocated) as seen from the four numbered plates in Figure [6j For illustrative purposes, we 
assigned nodes to communities using greedy allocation, i.e. we put the individual to the community into which 
it contributes most of its energy. The contribution of each atom i to the total binding energy of each structure 
k, as denoted by Wj € M^. x ^, can be viewed as the individual's degree of participation or, when normalised, 
probability of membership to that community. From our example, in Figure[7]we show the different membership 
distributions for four different nodes in the graph. In accordance to our intuition, we see that node 6, which acts 
as a mediator between different communities has a more entropic membership distribution while nodes such as 
4 or 14 have more confident assignments. 

We also can view real-world systems, such as social networks, under the framework we described above; 
groups of individuals are structures bound together with a given energy (time spent together, genetic relatedness, 
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Figure 4: Node allocations to communities for three different community detection methodologies. 




Figure 5: One of the NMF basis structures, as extracted from our toy graph. Each atom has a different degree 
of participation, as it can be seen from the colourmap on the left. Node 6 is a focal individual, contributing the 
most energy to the structure along with nodes 7 and 8, while nodes 4,5,9 and 10 are peripheral. 
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Figure 6: The community structure of our toy graph, where each coloured plate represents a different basis 
structure. For purposes of illustration, we assigned each node to the community with the highest membership 
probability. 




community 

Figure 7: For each node of the toy graph our method gives a probability membership distribution; in the 
horizonal axis we enumerate each community appearing in Fig. [6] and each bar represents the likelihood that 
node i belongs to the respective module. For nodes that weakly communicate with other groups (such as 4 and 
14) we see confident allocations, while individuals that lie on the boundary between communities (such as 6) 
have more entropic membership distribution. 
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tendency for cooperation, etc). Every individual contributes to a range of communities a certain amount of such 
energy, which can be also seen as his/her degree of membership. High-energy members can be regarded as focal 
individuals in a group, while social structures with members of uniform contribution can be regarded as teams 
that are held together because of equal participation of their members. Finally, under this framework we can 
identify highly social individuals, that belong to many groups with high amount of participation. 

Having used a simple graph to illustrate the intuition behind NMF-based community detection, we proceed 
in the following section to demonstrate its performance on artificial problems of larger scale and complexity. 



3.3 Benchmark datasets 

A very popular evaluation methodology for a community detection algorithm is to test it against an artificial 
network with "observed" community structure and measure how well the algorithm extracts the underlying 
mesoscopic organisation. The partition quality is usually compared with the original using the popular Nor- 
malised Mutual Information (NMI) criterion Q. 

We start with arguably the most popular type of benchmark problem: realisations of the Newman-Girvan 
random graph lPT4l (NG graph). We generate networks with N = 128 nodes and C = 4 communities with 
n = 32 nodes each where each one has an average degree of (k) = 16. By manipulating the expected inter- 
community degree (k out ) of nodes we test our algorithm against various levels of community cohesiveness. 
As seen in Figures [8] and [9] NMF produces state-of-the-art performance in extracting the original modules for 
any degree of fuzziness in the artificial network, outperforming the popular Spectral Partitioning and Hierar- 
chical Clustering (complete linkage - angular distance) method and having similar performance to Extremal 
Optimisation. 



Normalised Mutual Information across different values of <k > 

out 



- ■ -EO 

- ♦ - NMF 

Spectral 




Figure 8: Normalised Mutual Information and modularity across different levels of community cohesion in 
a Newman-Girvan random network. We compare our method (NMF) against other popular community de- 
tection methodologies such as Extremal Optimization (EO), Spectral Partitioning (Spectral) and Hierarchical 
Clustering (Hierarchical). 
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Membership distribution entropy across different values of <k QUt > 




Figure 9: Mean entropy (in bits) of NMF node membership probabilities for decreasing levels of community 
cohesion in a Newman-Girvan random graph. We notice that NMF can describe the ever-increasing fuzziness 
of the NG graph in terms of decreasing node allocation confidence. 



Although the NG graph is a very popular benchmark problem, it has been heavily criticised EOl for not 
reflecting the properties of real-world networks; NG graph realisations are small in size, with a fixed number of 
communities and fixed community populations while the degree distributions are uniform. For those reasons, 
Lancichinetti and Fortunato proposed a new class of benchmark problems [20] (which we shall refer to them 
as LF graphs) that produce networks of any size, with power-law degree and community size distributions. 
The community cohesiveness is controlled by a mixing parameter [i t , that signifies the expected fraction of 
intercommunity links per node. For the case of weighted LF graphs, we have a similar parameter fj, w that 
controls the strength allocation of a node between same-community members and outsiders. 

For the purposes of our experiments, we generated a variety of such networks with N = 1000 nodes and 
different parameters regarding the average degree (k) and the exponents 71, 72 of the degree and community 
size distributions. By starting with a small mixing parameter m = 0.1 and for each 0.1-step up to fj^ = 0.6 
(from 'clear' to 'fuzzy' community structure), we generate 100 realisations of the LF graph and monitor the 
module recognition performance of NMF using the popular normalised mutual information criterion. For the 
case of weighted LF graphs, we manipulate both mixing parameters at the same time, therefore m = p, w . The 
results, both for binary and weighted networks and for different configuration parameters are shown in Fig. 10 
anddU 



3.4 Real-world datasets 

In this section we present the performance of NMF on a variety of popular community detection problems. 
For these networks we have no "observed solution", therefore we measure the performance of our algorithm 
using the very popular Newman-Girvan modularity Q 11271 . In Table [T] we present a list of our datasets, along 
with their number of nodes N and edges M. Our algorithm is compared on the same data with Extremal 
Optimisation (EO) [10] and the Louvain [5] methods. We note other methods such as Spectral Partitioning and 
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NMF performance in binary LF graphs 
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Figure 10: Normalised Mutual Information across different levels of community cohesion in binary LF random 
networks. We start with a very cohesive network (low mixing parameter fj,) and proceed by making the network 
fuzzier. Each point in the graph represents the average over 100 realisations of an LF graph with the given 
parameters. The error bars represent one standard deviation. 

Hierarchical Clustering algorithms give significantly worse performance than either NMF, EO or Louvain and 
these results are not presented here. 

For each dataset we run NMF and EO 100 times with different random initialisations and monitored the 
values of modularity Q along with the number K* of extracted communities. As previously detailed, for 
NMF initialisation, we assume a possible maximum number of communities, K , equal to the number of nodes 
K = N (which is the maximum possible partition size for any network - though we find that running with 
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Figure 1 1 : Normalised Mutual Information across different levels of community cohesion in weighted LF 
random networks. Again, we start with a very cohesive network and proceed by making the network fuzzier. 
In this case of weighted network, we have the same mixing parameter for both intercommunity degrees and 
strengths (fit = fi w ). Each point in the graph represents the average over 100 realisations of an LF graph with 
the given parameters and the error bars represent our standard deviation. 

a lower value is preferred, as this reduces computation) and the 'effective' number of communities K* is 
then inferred from the data. The Louvain method has a very stable behaviour across different runs so we have 
omitted the standard deviation of modularity and community sizes for each dataset. The algorithmic complexity 
of our approach is 0(NK), as compared to 0(N 2 log N) for EO J5). We also note that, in practice, as EO 
requires stochastic steps, the run times of the two algorihms differ even more significantly. In the majority of 
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Table 1: Real- world datasets 



Dataset 


N 


M 


weighted? 


Dolphins (23l 


62 


159 


no 


Books US Politics 0H 


105 


441 


no 


Les Miserables flU 


77 


254 


yes 


College Football lPT4l 


115 


613 


no 


Jazz Musicians [ 15] 


198 


2742 


no 


C. elegans metabolic HI 


453 


2025 


no 


Network Science [27] 


1589 


2742 


yes 


Facebook Caltech OH 


769 


16656 


no 



applications, the maximum likely number of communities K -C N and so our approach can be very efficient 
and competitive against the Louvain method. 



Table 2: Modularity results against Extremal Optimisation and Louvain method 



Dataset 


NMF 


EO 


Louvain 


Dolphins 


0.47 ± 0.03 


0.51 ±0.01 


0.52 


Books US Politics 


0.52 ± e 


0.48 ± 0.01 


0.50 


Les Miserables 


0.53 ± 0.02 


0.53 ± 0.01 


0.57 


College Football 


0.60 ± e 


0.58 ± 0.01 


0.60 


Jazz Musicians 


0.43 ± 0.01 


0.42 ± 0.01 


0.44 


C. elegans metabolic 


0.36 ± 0.01 


0.40 ± 0.09 


0.43 


Network Science 


0.83 ±0.01 


0.86 ± 0.01 


0.95 


Facebook Caltech 


0.38 ±0.01 


0.37 ± 0.01 


0.37 



Table 3: NMF community sizes compared to Extremal Optimisation and Louvain method 



Dataset 


NMF 


EO 


Louvain 


Dolphins 


6.67 ± 0.83 


4±0 


5 


Books US Politics 


6.23 ± 0.62 


4.04 ± 0.4 


3 


Les Miserables 


9.97 ± 0.78 


4.96 ± 1.72 


6 


College Football 


8.86 ±0.79 


8 ±0 


10 


Jazz Musicians 


8.57 ±8.89 


4 ±0 


4 


C. elegans metabolic 


15.69 ± 1.14 


7.96 ± 1.06 


10 


Network Science 


342.53 ± 5.28 


58.24 ± 12.36 


418 


Facebook Caltech 


24.28 ± 1.72 


6.84 ± 1.82 


10 



In Table [2] we present our experimental results for each dataset of Table [T] We use the popular Newman- 
Girvan modularity Q as a performance measure of partition quality and we also present the number of identified 
communities. As Q can not account for the overlapping nature of communities, we use 'greedy allocation' i.e 
we assign a node to the module with the highest probability of membership. The results of NMF are presented 
alongside the very popular Extremal Optimisation and Louvain method for comparative analysis. From Table 
[2] we can see that our approach performs competitively yet is not an algorithm designed with the aim of max- 
imising modularity, unlike either EO or the Louvain methods. Additionally, it has the advantage of providing 
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probabilistic outputs for community membership (therefore achieving soft partitioning) and having low com- 
putational overhead. Finally, NMF does not suffer from the resolution limit lPT3l of modularity optimisation 
methods such as EO, where smaller groups are merged together QUI lfl3l . ending up with smaller number of 
communities, as seen in Table [3] 



4 Conclusions 



In this work we described a novel approach to community detection that adopts the Bayesian non-negative 
matrix factorisation model of |33l to achieve soft-partitioning of the network, assigning each node a probability 
of membership over all the extracted communities. That allows us not only to capture the fuzziness of the 
network (via the entropy of the membership distribution) but also to improve network cartography techniques 
|[T6l by identifying central and peripheral nodes in modules. Network visualization tools can also be improved 
in this manner. The approach is computationally efficient and offers performance comparable to state-of-the-art 
methods. Indeed the performance advantages for large data sets allow the NMF approach to be run many times 
compared to a single run of competing approaches. Clearly this allows for the selection of the best performing 
run, or a small ensemble of high-modularity solutions. 



5 Future work 

Future application work in this area addresses the analysis of a large zoological data set of interactions between 
members of a population of wild birds. As significant data exists and secondary verification data has been 
collated, such as breeding pair identifications etc. this offers a unique chance to verify any relationships that 
our approach detects. 

Work is currently underway to allow this model to be extended to incorporate dynamics such that non- 
stationary, time-varying, community relationships may be tracked. We have not discussed the handling of 
missing data in this paper, but taking missing observations into account may be readily handled in our approach 
and this is detailed in O. As mentioned in the paper, more complex priors over the set of fik would allow for 
domain knowledge to be incorporated and for correlated structures to be correctly dealt with. 
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